Skip to content

revert #16663, which was a revert of #16039 #16675

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jun 13, 2017
Merged

revert #16663, which was a revert of #16039 #16675

merged 7 commits into from
Jun 13, 2017

Conversation

dgwynne
Copy link
Contributor

@dgwynne dgwynne commented Jun 12, 2017

i believe the issue with #16039 was the loss of the semantics provided by the "b" part of the mode in fopen(). this is a nop on unix-like systems, but appears to be significant on windows.

the equivalent functionality on windows in open() is the O_BINARY flag. this adds that to the open() call, and provides compat for it on platforms without the flag by defining it to 0.

while here, fix an fd leak in new_file_source() if the buffer allocation fails.

  • closes #xxxx
  • tests added / passed
  • passes git diff upstream/master --name-only -- '*.py' | flake8 --diff
  • whatsnew entry

dgwynne added 3 commits June 12, 2017 11:35
on windows, EOF can appear "in band" if the file is considered text. when
moving from fread() to read(), i lost the "b" part of the mode. at the
time i believed this was a nop, since unix doesnt treat files differently
based on that flag.

this adds O_BINARY to the flags to open to restore the behaviour lost
when taking "b" away from fopen. if a platform doesn't provide O_BINARY,
this defines it to 0 so it can still be used without effect later on in
the code.
@gfyoung
Copy link
Member

gfyoung commented Jun 12, 2017

You need to re-add the test that I wrote to ensure that your change is correct.

@dgwynne
Copy link
Contributor Author

dgwynne commented Jun 12, 2017

i did think it was going too well...

@codecov
Copy link

codecov bot commented Jun 12, 2017

Codecov Report

Merging #16675 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #16675      +/-   ##
==========================================
- Coverage   90.92%   90.92%   -0.01%     
==========================================
  Files         161      161              
  Lines       49272    49271       -1     
==========================================
- Hits        44803    44802       -1     
  Misses       4469     4469
Flag Coverage Δ
#multiple 88.69% <100%> (-0.01%) ⬇️
#single 40.22% <0%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/io/parsers.py 95.43% <100%> (-0.01%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0281886...0a831b6. Read the comment docs.

@codecov
Copy link

codecov bot commented Jun 12, 2017

Codecov Report

Merging #16675 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #16675      +/-   ##
==========================================
- Coverage   90.92%   90.92%   -0.01%     
==========================================
  Files         161      161              
  Lines       49272    49271       -1     
==========================================
- Hits        44803    44802       -1     
  Misses       4469     4469
Flag Coverage Δ
#multiple 88.69% <100%> (-0.01%) ⬇️
#single 40.22% <0%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/io/parsers.py 95.43% <100%> (-0.01%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0281886...0a831b6. Read the comment docs.

@codecov
Copy link

codecov bot commented Jun 12, 2017

Codecov Report

Merging #16675 into master will decrease coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #16675      +/-   ##
==========================================
- Coverage   90.92%   90.92%   -0.01%     
==========================================
  Files         161      161              
  Lines       49272    49276       +4     
==========================================
+ Hits        44803    44804       +1     
- Misses       4469     4472       +3
Flag Coverage Δ
#multiple 88.68% <ø> (-0.01%) ⬇️
#single 40.18% <ø> (-0.04%) ⬇️
Impacted Files Coverage Δ
pandas/core/indexes/datetimes.py 95.23% <0%> (-0.1%) ⬇️
pandas/plotting/_core.py 81.83% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0281886...28ec409. Read the comment docs.

@gfyoung
Copy link
Member

gfyoung commented Jun 12, 2017

@dgwynne : LGTM, well done re-patching your patch 😄

@dgwynne
Copy link
Contributor Author

dgwynne commented Jun 12, 2017

@gfyoung : thanks. sorry for the breakage though. it never even occurred to me that there'd be encoding issues like this.

it's also encouraging to see how quickly such issues are handled. reverting the diff was obviously right at the time.

@gfyoung
Copy link
Member

gfyoung commented Jun 12, 2017

@dgwynne : No worries! We can't see everything that breaks if we don't test for it. ;)

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approach lgtm. couple of comments.

@@ -54,7 +54,6 @@ Indexing
I/O
^^^

- Bug in ``pd.read_csv()`` in which files containing EOF characters mid-field could fail with the C engine on Windows (:issue:`16039`, :issue:`16559`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re-add this (and add this PR number on as well)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok.

fs->initial_file_pos = ftell(fs->fp);
fs->fd = open(fname, O_RDONLY | O_BINARY);
if (fs->fd == -1) {
goto err_free;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would in-line these instead of goto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im not keen on replacing the goto with duplicate code in all the error paths. in my experience using goto in this specific situation, ie, unwinding the generation of state like this, has more benefits than caveats. inlining the cleanup leads to larger code, and is more error prone to maintain. if you add the allocation of something else in the future, you have to judge and fix the cleanup in multiple places rather than adding to the stack of goto targets.

if you feel strongly about it ill change it. after all, im just visiting :). i just wanted to discuss why i did it like this in the first place.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am fine either way

but right now it's about half and half

so if u can remove some calls and use the goto that's fine (or go the other way imho is actually more explicit)

code size makes almost no difference here
correct is the important issue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so inline?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

if (mm->fd == -1) {
fprintf(stderr, "new_file_buffer: open(%s) failed. errno =%d\n",
fname, errno);
goto err_free;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here

if (fstat(mm->fd, &stat) == -1) {
fprintf(stderr, "new_file_buffer: fstat() failed. errno =%d\n",
errno);
goto err_close;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here

/* XXX Eventually remove this print statement. */
fprintf(stderr, "new_file_buffer: mmap() failed.\n");
free(mm);
mm = NULL;
goto err_close;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here

@jreback jreback added the IO CSV read_csv, to_csv label Jun 12, 2017
@jreback jreback added this to the 0.20.3 milestone Jun 12, 2017
@jreback jreback merged commit d298414 into pandas-dev:master Jun 13, 2017
@jreback
Copy link
Contributor

jreback commented Jun 13, 2017

thanks @dgwynne !

TomAugspurger pushed a commit to TomAugspurger/pandas that referenced this pull request Jul 6, 2017
…as-dev#16675)

* Revert "BUG: Revert pandas-devgh-16039 (pandas-dev#16663)"

This reverts commit c550372.

* always treat files as binary to cope with windows and EOF.

on windows, EOF can appear "in band" if the file is considered text. when
moving from fread() to read(), i lost the "b" part of the mode. at the
time i believed this was a nop, since unix doesnt treat files differently
based on that flag.

this adds O_BINARY to the flags to open to restore the behaviour lost
when taking "b" away from fopen. if a platform doesn't provide O_BINARY,
this defines it to 0 so it can still be used without effect later on in
the code.

* dont leak the fd in new_file_source() if buffer allocation fails.

* reapply the test for EOF in the middle of a stream.

part of c550372

* pass rb to _get_handle on python 3, otherwise stick to r.

part of c550372

* replace goto with inline unwinding of state.

requested by @jreback in pandas-dev#16675 feedback.

* describe the fixes to the read_csv() backend and issue numbers.

requested by @jreback in feedback on pandas-dev#16675

(cherry picked from commit d298414)
TomAugspurger pushed a commit that referenced this pull request Jul 7, 2017
* Revert "BUG: Revert gh-16039 (#16663)"

This reverts commit c550372.

* always treat files as binary to cope with windows and EOF.

on windows, EOF can appear "in band" if the file is considered text. when
moving from fread() to read(), i lost the "b" part of the mode. at the
time i believed this was a nop, since unix doesnt treat files differently
based on that flag.

this adds O_BINARY to the flags to open to restore the behaviour lost
when taking "b" away from fopen. if a platform doesn't provide O_BINARY,
this defines it to 0 so it can still be used without effect later on in
the code.

* dont leak the fd in new_file_source() if buffer allocation fails.

* reapply the test for EOF in the middle of a stream.

part of c550372

* pass rb to _get_handle on python 3, otherwise stick to r.

part of c550372

* replace goto with inline unwinding of state.

requested by @jreback in #16675 feedback.

* describe the fixes to the read_csv() backend and issue numbers.

requested by @jreback in feedback on #16675

(cherry picked from commit d298414)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants